Notes:
getwd()
## [1] "/Users/webstrum/Study/old/udacity/docs/rcode/P07M01L09"
setwd("/Users/webstrum/Study/old/udacity/docs/rcode/P07M01L09")
In this lesson, Solomon will be using a linear regression model to predict diamond price using other variables in the diamonds dataset.
# Let's start by examining two variables in the data set.
# The scatterplot is a powerful tool to help you understand
# the relationship between two continuous variables.
# We can quickly see if the relationship is linear or not.
# In this case, we can use a variety of diamond
# characteristics to help us figure out whether
# the price advertised for any given diamond is
# reasonable or a rip-off.
# Let's consider the price of a diamond and it's carat weight.
# Create a scatterplot of price (y) vs carat weight (x).
# Limit the x-axis and y-axis to omit the top 1% of values.
library("ggplot2")
data("diamonds")
qplot(data = diamonds, x = carat, y = price,
xlim = c(0, quantile(diamonds$carat, 0.99)),
ylim = c(0, quantile(diamonds$price, 0.99))) +
geom_point(fill = "orange", color = "black", shape = 21)
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 926 rows containing missing values (geom_point).
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(fill = "blue", color = "black", shape = 21) +
coord_cartesian(xlim = c(0, quantile(diamonds$carat, 0.99)), ylim = c(0, quantile(diamonds$price, 0.99)))
Response: We can see a nonlinear relationship. Maybe it’s exponential or maybe it’s something else. We can see that the dispersion or variance of the relationship also increases as carat size increases.
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(fill = "blue", color = "black", shape = 21) +
coord_cartesian(xlim = c(0, quantile(diamonds$carat, 0.99)), ylim = c(0, quantile(diamonds$price, 0.99))) +
stat_smooth(method = "lm")
Notes:
Frances Gerety coined a famous slogan complete that slogan. > A diamonds is forever.
Notes:
The Counte of Monte Cristo and Knocked Up are two movies that have proposals without engagement rings.
Notes:
# install these if necessary
# install.packages('GGally')
# install.packages('scales')
# install.packages('memisc')
# install.packages('lattice')
# install.packages('MASS')
# install.packages('car')
# install.packages('reshape')
# install.packages('plyr')
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(lattice)
library(MASS)
library(scales)
library(memisc)
##
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
##
## percent
## The following object is masked from 'package:ggplot2':
##
## syms
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
##
## as.array
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, progress = FALSE,
lower = list(continuous = wrap("points", shape = I('.'))),
upper = list(combo = wrap("box", outlier.shape = I('.'))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What are some things you notice in the ggpairs output?
Response:
We can see what might be relationships between price and clarity and price and color, which we’ll keep in mind for later when start modelling our data.
Notes:
# Create two histograms of the price variable
# and place them side by side on one output image.
# We’ve put some code below to get you started.
# The first plot should be a histogram of price
# and the second plot should transform
# the price variable using log10.
# Set appropriate bin widths for each plot.
# ggtitle() will add a title to each histogram.
# You can self-assess your work with the plots
# in the solution video.
library(gridExtra)
plot1 <- qplot(data = diamond_samp, x = price, binwidth = 100, fill = I("#099DD9")) +
ggtitle('Price')
plot2 <- qplot(data = diamond_samp, x = price, binwidth = 0.01, fill = I("#F79420")) +
ggtitle('Price (log10)') +
scale_x_log10()
grid.arrange(plot1, plot2, ncol = 2)
Notes: Indeed we can see that the prices for diamonds are pretty heavily skewed, but when you put those prices on a log ten scale, they seem much better behaved. They’re much closer to the bell curve of a normal distribution. We can even see a little bit of evidence of bimodality on this log ten scale. Which is consistent with our two class rich buyer poor buyer speculation about the nature of customers for diamonds. You actually saw a sneak peek of this in problem set set three when you looked at the log ten of price by cut.
qplot(carat, price, data = diamonds) +
scale_y_continuous(trans = log10_trans()) +
ggtitle("Price (log10) by Carat")
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
inverse = function(x) x^3)
ggplot(aes(carat, price), data = diamonds) +
geom_point() +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1683 rows containing missing values (geom_point).
head(sort(table(diamonds$carat), decreasing = T))
##
## 0.3 0.31 1.01 0.7 0.32 1
## 2604 2249 2242 1981 1840 1558
head(sort(table(diamonds$price), decreasing = T))
##
## 605 802 625 828 776 698
## 132 127 126 125 124 121
# Add a layer to adjust the features of the
# scatterplot. Set the transparency to one half,
# the size to three-fourths, and jitter the points.
# If you need hints, see the Instructor Notes.
# There are three hints so scroll down slowly if
# you don’t want all the hints at once.
ggplot(aes(carat, price), data = diamonds) +
geom_point(alpha = 0.5, size = 0.75, position = "jitter") +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1691 rows containing missing values (geom_point).
Notes
No single diamond is perfect for everyone—but all of our customers, whether they’re eyeing a .50-carat or a 16-carat diamond, want as much sparkle as their budget allows. Of the the 4Cs (cut, color, clarity, carat), cut has the greatest influence on a diamond’s beauty and sparkle. Even a diamond with a flawless clarity grade (no blemishes or inclusions) can look glassy or dull if the cut is too shallow or deep. So, when determining what diamond to buy, go with the best cut grade that you can afford.
Alter the code below.
# install and load the RColorBrewer package
# install.packages('RColorBrewer')
library(RColorBrewer)
# Adjust the code below to color the points by clarity.
# A layer called scale_color_brewer() has
# been added to adjust the legend and
# provide custom colors.
# See if you can figure out what it does.
# Links to resources are in the Instructor Notes.
# You will need to install the package RColorBrewer
# in R to get the same colors and color palettes.
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(aes(color = clarity), alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
## Warning: Removed 1694 rows containing missing values (geom_point).
Response:
Holding carat weight constant, we’re looking at one part of the plot. We see the diamonds with lower clarity are almost always cheaper than diamonds with better clarity.
Alter the code below.
# Let’s look at cut and see if we find a similar result.
# Adjust the code below to color the points by cut.
# Change any other parts of the code as needed.
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(aes(color = cut), alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Cut', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Cut')
## Warning: Removed 1690 rows containing missing values (geom_point).
Response:
Despite what Blue Nile says, we don’t see much variation on cut. Most of the diamonds in the data are ideal cut anyway, so we’ve lost the color pattern that we saw before.
Alter the code below.
# Finally, let’s use diamond color to color our plot.
# Adjust the code below to color the points by diamond colors
# and change the titles.
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(aes(color = color), alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = "Color", reverse = F,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Color')
## Warning: Removed 1690 rows containing missing values (geom_point).
Response:
Color does seem to explain some of the variance in price. Just like we saw with the clarity variable. Blue now however, states that the difference between all color grades from D to J are basically not noticeable to the naked eye. Yet, we do see the color difference in the price tag.
Notes:
Use the lm() function lm(y ~ x)
Response:
Remember we applied the log transformation to our long tailed dollar variable, and we speculated that the flawless diamond should become exponentially rarer as diamond volume increases.
log(price) ~ carat^(1/3)
Notes:
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5, sdigits = 3, summary.stats=c("sigma","R-squared","F","p", "log-likelihood","Deviance", "N"))
##
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color,
## data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color +
## clarity, data = diamonds)
##
## ============================================================================================
## m1 m2 m3 m4 m5
## --------------------------------------------------------------------------------------------
## (Intercept) 2.821*** 1.039*** 0.874*** 0.932*** 0.415***
## (0.006) (0.019) (0.019) (0.017) (0.010)
## I(carat^(1/3)) 5.558*** 8.568*** 8.703*** 8.438*** 9.144***
## (0.007) (0.032) (0.031) (0.028) (0.016)
## carat -1.137*** -1.163*** -0.992*** -1.093***
## (0.012) (0.011) (0.010) (0.006)
## cut: .L 0.224*** 0.224*** 0.120***
## (0.004) (0.004) (0.002)
## cut: .Q -0.062*** -0.062*** -0.031***
## (0.004) (0.003) (0.002)
## cut: .C 0.051*** 0.052*** 0.014***
## (0.003) (0.003) (0.002)
## cut: ^4 0.018*** 0.018*** -0.002
## (0.003) (0.002) (0.001)
## color: .L -0.373*** -0.441***
## (0.003) (0.002)
## color: .Q -0.129*** -0.093***
## (0.003) (0.002)
## color: .C 0.001 -0.013***
## (0.003) (0.002)
## color: ^4 0.029*** 0.012***
## (0.003) (0.002)
## color: ^5 -0.016*** -0.003*
## (0.003) (0.001)
## color: ^6 -0.023*** 0.001
## (0.002) (0.001)
## clarity: .L 0.907***
## (0.003)
## clarity: .Q -0.240***
## (0.003)
## clarity: .C 0.131***
## (0.003)
## clarity: ^4 -0.063***
## (0.002)
## clarity: ^5 0.026***
## (0.002)
## clarity: ^6 -0.002
## (0.002)
## clarity: ^7 0.032***
## (0.001)
## --------------------------------------------------------------------------------------------
## sigma 0.280 0.259 0.250 0.224 0.129
## R-squared 0.924 0.935 0.939 0.951 0.984
## F 652012.063 387489.366 138654.523 87959.467 173791.084
## p 0.000 0.000 0.000 0.000 0.000
## Deviance 4242.831 3613.360 3380.837 2699.212 892.214
## N 53940 53940 53940 53940 53940
## ============================================================================================
## Significance: *** = p < 0.001; ** = p < 0.01; * = p < 0.05
Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.
Video Notes:
Our model is lm(price) = 0.415 + 9.144 * carat^(1/3) - 1.093 * carat + (. * cut + . * color + . * clarity) + error
Research:
(Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this)
Response:
2008->2014 * inflation * 2008 global recession * Diamond market in China growing * uneven recovery price increase across different carat weight
Notes:
#install.package('bitops')
#install.packages('RCurl')
library('bitops')
library('RCurl')
#diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
#load(rawConnection(diamondsurl))
load("BigDiamonds.Rda")
The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data
Notes:
# Your task is to build five linear models like Solomon
# did for the diamonds data set only this
# time you'll use a sample of diamonds from the
# diamondsbig data set.
# Be sure to make use of the same variables
# (logprice, carat, etc.) and model
# names (m1, m2, m3, m4, m5).
# To get the diamondsbig data into RStudio
# on your machine, copy, paste, and run the
# code in the Instructor Notes. There's
# 598,024 diamonds in this data set!
# Since the data set is so large,
# you are going to use a sample of the
# data set to compute the models. You can use
# the entire data set on your machine which
# will produce slightly different coefficients
# and statistics for the models.
# You can leave off the code to load in the data.
# We've sampled the data for you.
# You also don't need code to create the table output of the models.
# We'll do that for you and check your model summaries (R^2 values, AIC, etc.)
# Your task is to write the code to create the models.
# DO NOT ALTER THE CODE BELOW THIS LINE (Reads in a sample of the diamondsbig data set)
# diamondsBigSample <- read.csv('diamondsBigSample.csv')
# sample 10,000 diamonds from the data set
# set.seed(20022012)
# diamondsbig_samp <- diamondsbig[sample(1:length(diamondsbig$price), 250000), ]
# ENTER YOUR CODE BELOW THIS LINE. (Create the five models)
diamondsbig$logprice <- log(diamondsbig$price)
m1 <- lm(logprice ~ I(carat^(1/3)),
data = diamondsbig[diamondsbig$price < 10000 &
diamondsbig$cert == "GIA", ])
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5, summary.stats=c("sigma","R-squared","F","p", "log-likelihood","Deviance", "N"))
##
## Calls:
## m1: lm(formula = logprice ~ I(carat^(1/3)), data = diamondsbig[diamondsbig$price <
## 10000 & diamondsbig$cert == "GIA", ])
## m2: lm(formula = logprice ~ I(carat^(1/3)) + carat, data = diamondsbig[diamondsbig$price <
## 10000 & diamondsbig$cert == "GIA", ])
## m3: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut, data = diamondsbig[diamondsbig$price <
## 10000 & diamondsbig$cert == "GIA", ])
## m4: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color,
## data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert ==
## "GIA", ])
## m5: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color +
## clarity, data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert ==
## "GIA", ])
##
## ===============================================================================================
## m1 m2 m3 m4 m5
## -----------------------------------------------------------------------------------------------
## (Intercept) 2.671*** 1.333*** 0.949*** 0.529*** -0.464***
## (0.003) (0.012) (0.012) (0.010) (0.009)
## I(carat^(1/3)) 5.839*** 8.243*** 8.633*** 8.110*** 8.320***
## (0.004) (0.022) (0.021) (0.017) (0.012)
## carat -1.061*** -1.223*** -0.782*** -0.763***
## (0.009) (0.009) (0.007) (0.005)
## cut: V.Good 0.120*** 0.090*** 0.071***
## (0.002) (0.001) (0.001)
## cut: Ideal 0.211*** 0.181*** 0.131***
## (0.002) (0.001) (0.001)
## color: K/L 0.123*** 0.117***
## (0.004) (0.003)
## color: J/L 0.312*** 0.318***
## (0.003) (0.002)
## color: I/L 0.451*** 0.469***
## (0.003) (0.002)
## color: H/L 0.569*** 0.602***
## (0.003) (0.002)
## color: G/L 0.633*** 0.665***
## (0.003) (0.002)
## color: F/L 0.687*** 0.723***
## (0.003) (0.002)
## color: E/L 0.729*** 0.756***
## (0.003) (0.002)
## color: D/L 0.812*** 0.827***
## (0.003) (0.002)
## clarity: I1 0.301***
## (0.006)
## clarity: SI2 0.607***
## (0.006)
## clarity: SI1 0.727***
## (0.006)
## clarity: VS2 0.836***
## (0.006)
## clarity: VS1 0.891***
## (0.006)
## clarity: VVS2 0.935***
## (0.006)
## clarity: VVS1 0.995***
## (0.006)
## clarity: IF 1.052***
## (0.006)
## -----------------------------------------------------------------------------------------------
## sigma 0.289 0.284 0.275 0.216 0.154
## R-squared 0.888 0.892 0.899 0.937 0.969
## F 2700903.714 1406538.330 754405.425 423311.488 521161.443
## p 0.000 0.000 0.000 0.000 0.000
## Deviance 28298.689 27291.534 25628.285 15874.910 7992.720
## N 338946 338946 338946 338946 338946
## ===============================================================================================
## Significance: *** = p < 0.001; ** = p < 0.01; * = p < 0.05
# DO NOT ALTER THE CODE BELOW THIS LINE (Tables your models and pulls out the statistics)
suppressMessages(library(lattice))
suppressMessages(library(MASS))
suppressMessages(library(memisc))
models <- mtable(m1, m2, m3, m4, m5)
Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601
#Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
interval="prediction", level = .95)
exp(modelEstimate)
## fit lwr upr
## 1 5040.436 3730.34 6810.638
Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.
dat = data.frame(m4$model, m4$residuals)
with(dat, sd(m4.residuals))
## [1] 0.2164168
with(subset(dat, carat > .9 & carat < 1.1), sd(m4.residuals))
## [1] 0.2178147
dat$resid <- as.numeric(dat$m4.residuals)
ggplot(aes(y = resid, x = round(carat, 2)), data = dat) +
geom_line(stat = "summary", fun.y = sd)
## Warning: Removed 1 rows containing missing values (geom_path).
Notes:
Data and models are never infallible and you can still get taken even equipped with this kind of analysis. There’s no substitute for establishing a personal connection and lasting business relationship with a jeweler you can trust.
We hope you feel prepared to explore data on your own, and answer questions that are interesting to you.
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!